Data Science on Blockchain, how to get started?

Two analysis examples on NFT and Helium

Thomas de Marchin

14DEC2022

Introduction

  • Associate Director, Statistics & Data Science at Pharmalex: support pharmaceutical companies in drug development
  • Gravitated towards blockchain technology and the place of data science within it
  • Lots of data available on the blockchain (~ 1x10^6 transaction per day on ethereum), how to access it?
  • Obtaining cryptocurrencies price data is straightforward, more sophisticated data manipulations require access to the “source” data
  • Github: https://github.com/tdemarchin
  • Blog: https://tdemarchin.medium.com/

Blockchain is accessible but hard to read

  1. Several Tb of data. Ethereum ~ 6x10^9 transactions.

  2. Data are stored sequentially, requires developing specific tools to follow a transaction.

  3. The structure of a transaction is difficult to read

  4. Fragmentation of blockchain technologies

A huge hard drive

Image from ig.com

Structure of a transaction

How to get the data ?

  1. Set-up an ETL: Extract, Transform and Load. Most flexible but need to set up a server with huge and fast hard-drives.
  2. Use data providers:
  • Dashboards: Dune analytics, Nansen, GraphSense, icy.tools,…
  • APIs (software intermediary that allows two applications to talk to each other): OpenSea, Etherscan, Infura, The Graph,…

Two examples implemented in R (lots of visualisation packages):

  1. How to track and visualize NFT transactions: OpenSea and EtherScan APIs
  2. Blockchain IoT data visualization and the rise of the Helium network: full data dump from The Helium Foundation ETL

NFTs

  • Non-Fungible Tokens: represent ownership of unique items (art, collectibles, patents, real estate,…)
  • Smart contracts: decentralized programs stored on a blockchain that run when predetermined conditions are met
  • What can we do with NFT-related data ?

NFTs price analysis with OpenSea API

OpenSea: big NFT market place

resOpenSea <- GET("https://api.opensea.io/api/v1/events",
          query = list(limit = 300, 
                       event_type = "successful", 
                       only_opensea = "true")) 

  • 300 transactions max, limited to OpenSea transactions
  • Pre-processed by OpenSea and not raw blockchain transactions
  • Mainly price analysis
  • Analysis done in June 2021: Link

NFTs tracking with EtherScan API: the data

https://weirdwhalesnft.com/

  • Weird Whales: collection of 3350 whales programmatically generated, each with their unique characteristics and traits. Created by a 12-year-old programmer named Benyamin Ahmed who made the buzz.
  • EtherScan: block explorer to view information about transactions, verify contract code, visualize network data \(\rightarrow\) access to more raw data
  • Analysis done in October 2021 and updated in December 2022: Link
  • Weird Whales are managed by a specific smart contract on the Ethereum blockchain

  • To make it easier to extract information from the blockchain, we can read the events: dispatched signals (easy to read) the smart contracts can fire.

resEventTransfer <- GET("https://api.etherscan.io/api",
                          query = list(module = "logs", 
                                       action = "getLogs", 
                                       fromBlock = fromBlock, 
                                       toBlock = "latest",
                                       address = "0x96ed81c7f4406eff359e27bff6325dc3c9e042bd", 
                                       topic0 = "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef",
                                       apikey = EtherScanAPIToken)) 

Where is the sales price? On OpenSea, sales are managed by the main contract and if approved, the second contract is called (here Weird Whales), which then triggers the transfer \(\rightarrow\) need to download all the transactions from the OpenSea main smart contract address and then filter for the ones related to Weird Whales (~ 10000 API calls, can take several hours…).

  • Ethereum / USD rate is highly volatile, need to obtain the historical ETH market price if we want to convert ETH to USD.
  • A spline is fitted on data from the CoinGecko API

NFTs tracking with EtherScan API: analysis and visualisation

  • High variability in prices at the begining (buzz), followed by a quieter period.
  • Some NFTs are exchanged outside the OpenSea Market (price = 0) \(\rightarrow\) difficult to trace….
  • Starting March 2022, all sales price felt to 0 \(\rightarrow\) OpenSea was hacked in March, they probably updated their smart contract…
  • Perfect data to be plotted as a network. We will display all the wallet addresses that have ever traded Weird Whales (restricted to tokens involved in at least 10 transactions).

  • Networks are described by:

    • Vertices (or nodes): the wallet addresses
    • Edges (or links): the transactions
  • network package:

# create the network using network function
network <- network(edges,  
                   vertex.attr = vertices, 
                   matrix.type = "edgelist", 
                   loops = T, 
                   multiple = T, 
                   ignore.eval = F)

  • Each color represents a unique token ID
  • All transactions of all the tokens originate from the single minting address (1)
  • Some addresses are involved in multiple transactions (curved edges)
  • Some tokens were transferred to an address only to be sent back to the sender…?
  • Made with the ggraph packages

About 2/3 of the transactions happened very shortly after the NFT’s creation

  • Made with the network and networkDynamic packages

Helium IoT network

 

Helium, the people’s network. Photo from Nima Mot.

 

A miner

 

Rewards

 

  • Helium: decentralized wireless infrastructure for IoT devices (environmental sensors, localisation sensors to track bike fleets,…), blockchain that leverages a decentralized global network of Hotspots
  • People are incentivized to install hotspots and become a part of the network by earning Helium tokens
  • Lot of content available about the coverage aspect of Helium, but what about the real use ? :
    • How big is the Helium network, where are located the hotspots?
    • Are they actively utilized, i.e. are they used to transfer data with connected devices?
  • Analysis done in March 2022: Link
  • Dewi ETL project: dedicated ETL server up and running. Interface (metabase) available to download extracts but limited to 10^6 rows. Fortunately, the team put CSV extracts in 10k/50k-block increments on a server: centralized, bad practice!

Helium IoT network: hotspot data

Liege

  • H3: geospatial indexing system using a hierarchical hexagonal grid. At resolution 8: earth is covered by 691,776,122 hexagons \(\rightarrow\) we convert to lat/lng

  • Some owners own more than 50 hotspots!
  • Three phases: (1) a slow linear increase, (2) an exponential increase in the middle of 2021 followed by (3) a fast linear increase (did it continue?)

Cumulative growth of the network infrastructure in terms of number of hotspots added

  • We start by creating an empty world map on which we overlay the hotspot data.
  • Plotting all the individual hotspots on a map would be too much (> 500k hotspots) \(\rightarrow\) we cluster the hotspots into hexagons (hexbin and geom_hex)
  • Most hotspots are located in North America, Europe and Asia, mostly in big cities. Practically no hotspots in Africa, Russia and very few in South America. A few hotspots in the middle of the ocean… Data issue or cheating to increase rewards?

Helium IoT network: network usage

  • We understood that hotspots are distributed on the planet and provide good coverage in big cities. Are they being actively used by connected devices and how often? \(\rightarrow\) Transaction data
  • Every 24 bytes sent cost 1 Data Credit (DC) = $0.00001
  • We can look at network usage from two perspectives: (1) check the volume of data exchanged and (2) check how often the hotspots have been involved in data transfer with connected devices

Total volume of data exchanged

Cumulative sum of the number of transactions

  • Slow linear increase followed by an exponential increase, which is finally followed by a fast linear increase
  • Glitch in November 2021: major outage of the blockchain
  • Despite having about 15% of the hotspots, Asia don’t seem to be so active in terms of data transfer
  • Median number of transactions per hotspot: 42
  • 40.48% hotspots did not participate in any transaction so far.

Evolution of the number of transactions across the globe

  • Transactions mainly occur in North America before mid 2020, then followed by a strong wave in Europe and Asia. Barely no transaction have occurred in South America and Africa.
  • Made with the gganimate package

Distribution of the transactions in the US

Distribution of the transactions in Belgium

  • US: Transactions are homogeneously distributed across the country although the peaks of activity (note that the legend is logarithmic!) are located around big cities (New York, Los Angeles, San Francisco, Miami)
  • BE: Transactions are not homogeneously distributed across the country \(\rightarrow\) most transactions happen in the upper part of the country, which is consistent with the lower part of the country being scarcely populated (Ardennes)
  • Made with the rayshader package, interactive version available in the original article

Conclusion

  • Blockchain: approaches for reading the blockchains and on how to visualize the blockchain transactions
  • NFT: We have shown an example of how to download and plot a network of transactions associated to NFTs. Analysing and plotting transactions is easy once you have the data. On the other hand, obtaining data in an appropriate format requires a deep understanding of the blockchain, especially when working with smart contracts.
  • Helium: We have shown some techniques on how to summarise and visualise a network growth in term of infrastructure (hotspots) and data usage (transactions). We have analysed spatio-temporal data and have plotted them using dedicated R packages.

Next step: analyse the networks using the graph theory